

Dragged Kicking and 

Screaming: 
Source Multicore 


Tom Leonard, Valve 
9 March 2007 


VALVE 


WWW.GDCONF.COM 






VALVE 


WWW.GDCONF.COM 



Multicore 
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Most significant development since 
consumer 3D 

Explicit parallelism 

Hardware problem becoming software 
problem will require new techniques 





VALVE 


WWW.GDCONF.COM 



Introduction 

® The decisions faced with multiple cores 
® How we are approaching multiple cores 
® Algorithms and paradigms 
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Goals 

® Integrate multicore across Valve’s 
business 

Expose to game programmers, licensees and 
MOD authors 
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Valve’s 


® Scale to cores without recompile 
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Valve’s 


® Scale to cores without recompile 
® Create value beyond framerate 


Apply cores to new gameplay 
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Challenges 


® Games want maximal CPU utilization 
® Games are inherently serial 


® Decades of experience in single threaded 
optimization 


® Millions of lines of code written for single 
threading 
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Strategies 

® Threading model 
® Threading framework 
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Threading Models 

® Fine grained threading 
® Coarse threading 
® Hybrid threading 
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Diving In 


Client 

® User input 
® Rendering 
® Graphics simulation 

Server 

® Al 

® Physics 
® Game logic 
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® Benefits: forced to confront systems that 
are not thread safe or not thread efficient 
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Discoveries 
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s> Problem: shared data access 

® Global data 

Static data (optimizations/function local state) 
® Singleton objects 
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Discoveries 


Problem: shared data access 


Bgjgs ® Thread safety is easy! 
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Discoveries 


tJi isoo 7 ,r - 


Problem: shared data access 

Thread safety is easy! 

Slap on a mutex/critical section 
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Discoveries 

® Problem: shared data access 
® Bad thread safety is easy! 


Slap on a mutex/critical section 


The simple thing is the worst thing 

® Mutexes are terrible 
@ Excessive waits 
® Error prone 
® Fail to scale 

Establish slow but stable baseline 
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Discoveries 
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5 Efficient thread safety 

No synchronization (“wait-free”) 


® Each thread has a private copy of all the data 
needed to perform operation: 

® Threads working on independent problems 
® Replace globals with thread private data 
® Reorient to pipeline 


® Example: Source “Spatial Partition” 
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Discoveries 



valve 


Spatial Partition 


Client Objects 


Static Objects 
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Discoveries 
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Discoveries 


® Efficient thread safety 

No synchronization (“wait-free”) 

® Better synchronization tools, techniques 


® Analyze data access 

® Example: symbol table using read/write lock 
Decouple using queued function calls 
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Discoveries 
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s> What if you can’t eliminate contention 
over shared resources? 
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contrived maps 
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Results 


CORE 0 


CLIENT 
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100 % 


0 % 
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Results 
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Results 


Can approach 2x in contrived maps 
® More like 1.2x in real single player 
Applicable to 360 Team Fortress 2 
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Hybrid threading 

® Use the appropriate tool for the job 

® Some systems on cores (e.g. sound) 

® Some systems split internally in a coarse manner 
® Split expensive iterations across cores fine grained 
® Queue some work to run when a core goes idle 


Need strong tools 
Maximal core utilization 
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Hybrid threading: Rendering 


Render 


COl— - m 


Skybox 


Main View 


Monitor 


Etc. 


I 


Scene List 
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® Problems 

Per-view scene construction limits 


Hybrid threading: Rendering 


opportunity 

Arbitrary object type order 
Arbitrary code execution 

® Simulation and Rendering interleaved 

Lazy calculation optimizations 
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® Iterative Transition: Skeletal Animation 

Parallelize lazy calculation triggers 
Refactor bone setup into single pass per view 
Refactor into single pass for all views 
Same pattern for other CPU-intensive stages 
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Hybrid threading: Rendering 

® Revised pipeline 

® Construct scene rendering lists for multiple scenes in 
parallel (e.g., the world and its reflection in water) 


® Overlap graphics simulation 


® Compute character bone transformations for all 
characters in all scenes in parallel 

® Allow multiple threads to draw in parallel 

® Serialize drawing operations on another core 
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Threading Tools 

Implementing Hybrid Threading 

Programmers solve game development 
problems, not threading problems 

Empower all programmers to leverage cores 


Operating system: too low level 
Compiler extensions (OpenMP): too opaque 
Tailored tools: correct abstraction 
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Tailored tools: Game 
Threading Infrastructure 

® Custom work management system 

Intuitive for programmers 
Focus on keeping cores busy 
Thread pool: N-l threads for N cores 
Support hybrid threading 


® Function threading 

® Array parallelism 

® Queued and immediate execution 
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Tailored tools: Game 
Threading Infrastructure 

, ® Goal: make system easy to use, hard to 
mess up 

$ Example: compiler generated functors 


Uses templates to package up functions and 
data, point of call looks very similar 

Call arrives on other end as if called normally 

Saves time, reduces error, encourages 
experimentation 
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Tailored tools: Game 
Threading Infrastructure 

® One-off push to another core 


if ( ! I sEngi neThreaded( ) ) 


_Host _RunFr arre_Ser ver ( nurrticks ); 

el se 

Thr eadExec ut e( _Host _RunFr arre_Ser ver , nurrticks ); 
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Tailored tools: Game 
Threading Infrastructure 


Parallel loop 


v tj5 20Q Tin ' 


void Pr ocess PSyst errt CPart i cl eEf f ect *pEffect ); 


Paral I el Process( part i cl esToSi rrul ate. Base( ) , 

parti cl esToSi rrul at e. Count ( ) , 
Pr ocess PSyst em ) ; 
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Tailored tools: Game 
Threading Infrastructure 

® Queue up a bunch of work items, wait for 
them to complete 


Begi nExecuteParal I el ( ) ; 

ExecuteParal I el ( gpParti cl eSystem 

&CPar t i c I eSys t em : Updat e, t i me ) ; 
ExecuteParal I el ( StUpdat e Ropes , t i me ) ; 

EndExec ut ePar al I el ( ) ; 


® Low level APIs for the brave 
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Contention 

® What if you can’t eliminate contention over 
shared resources? 

® Example: Allocator 


Heavily used 

® Multiple pools of fixed sized blocks with a 
custom spin lock mutex per-pool 

Mutex limiting scale 


Didn’t want per-thread allocators 
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Contention 


® Lock-free algorithms 

No thread can block system regardless of 
scheduling or state 

Under the hood of all services and data 
structures 


Relies on atomic write instructions, 
“compare-and-swap” 
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Contention 


bool Corrpar eAndSwap( i nt *pDest, 
■C 

Lock( pDest ) ; 

bool success = f al se; 

i f ( *pDest == ol dVal ue ) 


*pDest = neWV/al ue; 
success = t r ue; 


> 

Uni ock( pDest ) ; 
r et ur n success; 


i nt neWVal ue, i nt ol dVal ue) 
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Contention 



bool Corrpar eAndSwap( i nt *pDest, i nt nev\A/al ue, i nt ol dVal ue) 


mov eax, ol dVal ue 

nx> v ecx, pDest 

nrov edx, nev\A/al ue 

I ock crrpxchg [ ecx] , edx 

mov eax, 0 

setz a I 


> 


> 
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Contention 

® Use lock-free algorithm in allocator 

Replace mutex and traditional free list per- 
pool with a lock-free list per-pool 

Windows API/XDK SList 
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Lock-free example: singly 
linked list 

® Compare-and-swap 

“If head is equal to what I think it is, assign 
with my new head” 

ABA Problem: is it the same head? 

Use a serial number as a discriminating field 
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Lock-free example: singly 
linked list 


cl ass CSLi st 

{ 

publ i c : 

CSLi st ( ) 

voi d Push( SLi st Node_t *pNode ) ; 
SLi st Node_t *Pop( ) ; 

SLi st Node_t *Detach( ) ; 
i nt Count ( ) const ; 


pr i vat e: 

SLi st Head_t m_Head; 


>; 
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Lock-free example: singly 
linked list 

struct SLi st Node_t 

{ 

SLi st Node_t *pNext ; 

>; 


union SList Head_t 


struct Val ue_t 

■C 

SLi st Node_t *pNext ; 
i nt 16 i Dept h; 
i nt 16 i Sequence; 

> val ue; 
i nt 64 val ue64; 
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Lock-free example: singly 
linked list 


Voi d Push( SLi st Node t *pNode ) 

{ 

SLi st Head_t ol dHead, newHead; 
f or ( ; ; ) 

{ 

ol dHead. val ue64 = m_Head. val ue64; 

newHead. val ue. i Dept h = ol dHead. val ue. i Dept h + 1; 
newHead. val ue. i Sequence = ol dHead. val ue. i Sequence 
newHead. val ue. Next = pNode; 
pNode- >pNext = ol dHead. val ue. pNext ; 

i f ( Threadl nt er I ockedAssi gnl f 64( &m_Head. val ue64, 
newHead. val ue64, ol dHead. val ue64 ) ) 

{ 

r et ur n; 

> 


+ 1 ; 


VALVE 


WWW.GDCONF.COM 



Lock-free example: singly 
linked list 

® Lock-free list exceptionally useful 

Keep pools of context structures when 
impractical to give every thread a context 


Efficiently gather results of a parallel process 
for later handling 

Build up lists of data to operate on using 
PushQ, then use Detach() (a.k.a “Flush”) to 
grab the data in another thread in a single 
operation 
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gjs. Example 

extern Vector trace_start; 
extern Vector trace end; 




struct cbrush_t 

i nt cont ent s ; 

unsi gned short nurrsi des; 
unsi gned short f i rstbrushsi de; 
i nt 


checkcount; // to avoi d repeated testi ngs 


>; 


voi d Begi nTrace( ) 

{ 

g_CModel Mutex. Lock( ) ; 
++s_nChec kCount ; 

> 
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Siv Example 


COl- - 0T 


struct Tracel nf o_t 

{ 

Vector rn_start; 

Vect or rrLend; 

II et c. . . 

CVisitBitVec m BrushVi si 

>; 


t s; 


CTracel nfoPool g Tracel nfoPool ; 

Tracel nf o_t *Begi nTrace() 

Tracel nf o_t *pTracel nf o; 

i f ( ! g_Tracel nfoPool . Popl t errf &pTracel nf o ) ) 

pTracel nf o = new Tracel nf o_t ; 


return pTracel nfo; 

> 
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Lock-free algorithms 

® Thread pool work distribution queue 

Derived from HL2 asynchronous I/O queue 
® Designed for one provider, one consumer 
Simple prioritized queue with mutex 
® Arbitrary priority 
One queue for all threads 
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Lock-free algorithms 


® Solutions 

Use lock-free queue (Fober, et. al.) 

Rework interface to fixed priorities, one 
queue per-priority 


® Interfaces critical 

Queues per core in addition to a shared 
queue 


Use atomic operations to get “ticket”, actual 
work done may differ 
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Lock-free algorithms 
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Locks permit a stable reality 

Lock-free permits reality to change 
instruction to instruction 

Leverage inference rather than locks to 
know part of the system is stable 


Wait-free is always better 
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Looking Forward 


® Why so much up-front investment? 
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Looking Forward 

® Why so much up-front investment? 

® Steam 

® Communicate with customers 
® Tap markets not available via retail 


Dramatic change is underway 

® Core count double every 18 months 
® CPU/GPU/PPU/AIPU/etc not the future 
® Many homogeneous cores 
® Division of computing power a software problem 



VALVE 


WWW.GDCONF.COM 



Call to action 


® Build or acquire strong tools, new techniques 

® Embrace lock-free mechanisms to move work and data to and 
from wait-free code 


® Prepare for decomposition of features over many cores 

® Use accessible solutions to empower all programmers, not just 
systems programmers 

® Support even higher level threading framed in terms of game 
problems 
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Summary 

® Started with a stable but bad threading 

® Iteratively eliminated bad cases using 
variety of techniques, usually lock-free 


® During iterations, expanded toolset to 
meet newly discovered needs 

® Focused on ease-of-use for other 
programmers 


® Now being applied by others at higher 
levels 
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In Source SDK this summer 
Contact: tom_gdc@valvesoftware.com 
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